The graphs below use CalEnviroScreen 4.0 Data which refers to the year of 2020.
There is a total of 21 indicators, including Asthma and PM2.5 which we will closely look at. “Each indicator is assigned a score for each census tract in the state based on the most up-to-date suitable data. Scores are weighted and added together within the two groups to derive a pollution burden score and a population characteristics score. Those scores are multiplied to give the final CalEnviroScreen score.”
Asthma Indicator: spatially modeled, age-adjusted rate of ED visits for asthma per 10,000
PM 2.5 Indicator: annual mean concentration of PM2.5 (weighted average of measured monitor concentrations and satellite observations, μg/m3)
The first map graphically displays PM2.5 concentration as a means of measuring air quality while the second graph does the same for Asthma indicator levels throughout the Bay Area.
This scatter plot shows the correlation between the values for the Asthma Indicator values and the PM2.5 concentration in Air Quality. Although the best fit line shows a positive correlation at this point, it does not appear to be a great predictor of the data. Especially for PM2.5 values between 8 and 9. Ultimately, I predict the residuals will be larger than we would like them to ideally be.
Here is a summary of the model:
##
## Call:
## lm(formula = Asthma ~ PM2.5, data = bay_asthma_pm_tract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.47 -25.89 -9.61 12.94 182.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -116.278 13.040 -8.917 <2e-16 ***
## PM2.5 19.862 1.534 12.950 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.49 on 1578 degrees of freedom
## Multiple R-squared: 0.09606, Adjusted R-squared: 0.09549
## F-statistic: 167.7 on 1 and 1578 DF, p-value: < 2.2e-16
As you can see, “An increase of PM2.5 in one unit is associated with an increase of Asthma in 19.862”; “9.6% of the variation in PM2.5 is explained by the variation in Asthma”.
From this plot, you can observe the highest density of residuals is around -20, showing how innacurate the line of best fit was.
Now, we can see the same data layered with a log function. The correlation is clearly significantly stronger. The line of best fit in the scatter plot appears to be way more representative which is confirmed by the data in the table (reduction from 19.862 to 0.35633) and the residuals graph.
##
## Call:
## lm(formula = log(Asthma) ~ PM2.5, data = bay_asthma_pm_tract)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00402 -0.46479 0.03313 0.42298 1.75525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69234 0.22840 3.031 0.00248 **
## PM2.5 0.35633 0.02686 13.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6566 on 1578 degrees of freedom
## Multiple R-squared: 0.1003, Adjusted R-squared: 0.09974
## F-statistic: 175.9 on 1 and 1578 DF, p-value: < 2.2e-16
Now, “An increase of PM2.5 in one unit is associated with an increase of log(Asthma) in 0.356”; “10.03% of the variation in PM2.5 is explained by the variation in log(Asthma)”.
This plot of the residuals of the log function has it’s highest density concentrated around 0 showing the accuracy of this new model. Compared to the same graph of the original data, there is a visible shift to the right.
Combining the residuals with spatial information, we can see how the accuracy of the model varies according to location. From my understanding, a low residual represents a high accuracy of the modelled regression and the data collected, meaning it could be used to make reasonable estimates. A positive residual represents an underestimation while a negative value represents an overestimation.
As seen in the map, the area with lowest residual is Stanford’s campus with a -2.004 value.